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The criterion however requires the measurement dimension be smaller than 
the sample size. The trace-based criterion in contrast, is an independence rule 
and effective in the “large dimension-small sample size” scenario. An appealing 
property of these two criteria is that their implementation is straightforward 
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rameters. Their asymptotic misclassification probabilities are derived using the 
theory of large dimensional random matrices. Their competitive performances 
are illustrated by intensive Monte Carlo experiments and a real data analysis. 
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1 Introduction 

In recent years, there is a great deal of attention paid to the development 
of high dimensional classification methods. Many independence rules are pro¬ 
posed to deal with the situations where the correlations between variables 
are weak. Tibshirani et al. (2002) proposed the nearest shrunken centroid 
(NSC) classifier. Fan and Fan (2008) proposed the features annealed inde¬ 
pendence rule (FAIR). Moreover, Bickel and Levina (2004) showed that the 
independence rule, naive Bayes (NB) performs better than the naive Fisher 
discriminant (NFR) where the variables are correlated. When the correlations 
are significant, NFR is about the same as random guess. They also showed 
that a classification procedure using a subset of well selected features is better 
than that using all the features, which typically accumulates much noise in 
estimating population centroids in high dimensional space. 

In addition, methods integrating the covariance structure have been pro¬ 
posed in the literature, such as support vector machines (Vapnik 1995), shrunken 
centroids regularized discriminant analysis (SCRDA) (Guo et al. 2005), sparse 
linear discriminant analysis (Shao et al. 2011) and DDa —procedure (Lange 
et al. 2014). A recent work Fan et al. (2012) proposed a new method that 
involves correlation information, called regularized optimal affine discriminant 
(ROAD). Interestingly enough, the classification error of the ROAD decreases 
as the correlation coefficient increases. Two variants are screening-based rules, 
named S-ROADl and S-ROAD2, which select only 10 features and 20 fea¬ 
tures, respectively. In the simulation study, under the “large p-small n” and 
equal correlation setting, the ROAD method outperforms the available classi¬ 
fiers mentioned above. S-ROAD2 also performs well, while S-ROADl fails for 
highly correlated variables. Notice that the ROAD and its variants have to se¬ 
lect variables in the procedure of classification. Although variable selection has 
been extensively developed in last decades, their practical implementation still 
faces several difficult issues such as the choice of turning parameter or thresh¬ 
olding values. In this paper, we investigate whether there are straightforward 
methods that can have competitive performances without preliminary vari¬ 
able selection. In addition, existing methods mainly focus on “large p-small n” 
case and the localized mean vector scenario (see follows for exact definition). 
However, the case of “large p-large n" with comparable magnitude and the 
delocalized scenario are common issues in high dimensional classification. The 
classification rules proposed in the paper will help to handle these situations. 

Saranadasa (1993) proposes the determinant-based (D-) and trace-based 
(T-) criteria. Their asymptotic misclassification probabilities are established 
for normal populations. In this paper, we focus on the performance of these 
two criteria in the delocalized scenario without the normal assumption. Specif¬ 
ically, consider two p-dimensional multivariate populations 77i and 77 2 with 
respective mean vectors /xi, /x 2 and common covariance matrix X. The param¬ 
eters Hi, H'2 and 52 are unknown and thus estimated using training samples 
X = (xi,x 2 ,..., x„J T from TTi and Y = (yi,y 2 ,..., y« 2 ) T from 77 2 with re¬ 
spective sample size ni and n 2 . A new observation vector, z = (z\, 2 2 ,..., z p ) T 
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is known to belong to ZZi or 77 2 and the aim is to find exactly its origin popula¬ 
tion. More complicated sample setting can refer to Leung (2001), which consid¬ 
ers mixed continuous and discrete variables in each group. Cheng (2004) stud¬ 
ies the situation where the two populations have different covariance matrices. 
Krzysko and Skorzybut (2009) considers the multivariate repeated measures 
data with Kronecker product covariance structures. 

Let (x), (y) be the two training sample mean vectors where 

1 n 1 1 "2 

xi = —and yi = —VVp l = l,2,...,p. 

n -2 ( 

*=1 j=i 

If the vector z is classified to the population TTi, then the overall within group 
sum of squares and cross products matrix is 

ni ri2 

A i = ~ x )( x * ~ x )' + X](yj - y)(yj - y)' + „ n ]A z ~ x )( z - *)'• 

*=1 j=i 

While, if z is classified to 77 2 , then the sum is 

ni ri2 

a 2 = X^ x * ~ x X x * _ x )' + Yl(yj - y)(yj - y ) 7 + ~fr( z - y)( z - y)'- 

, —: u-2 + 1 

1=1 J=1 

Intuitively, one would decide z € III when Ai is in some sense “smaller” than 
A 2 . The D-criterion defines this smallness to be 

det(Ai) < det(A 2 ), (1) 

and the T-criterion defines it to be 

tr(Ai) < tr(A 2 ). (2) 

Two scenarios of mean difference S = /x 2 — Hi are defined as follows: 

1. Localized scenario: the difference <5 is concentrated on a small number 
of variables. We set /ii = 0 and /x 2 equals to a sparse vector: pb 2 = 
(l(j 0 ,Op_ no ), where n 0 is the sparsity size. Notice that the location of the 
no non-zero components does not influence the performance of various clas¬ 
sifiers. 

2. Delocalized scenario: the difference S is dispersed in most of the variables. 
To ease the comparison with the localized scenario, we choose the parame¬ 
ters such that the averaged Mahalanobis distances are the same under these 
two scenarios. This is motivated by the fact that following Fisher (1936), 
the difficulty of classification mainly depends on the Mahalanobis distance 
A 2 = d , '£~ 1 8 between two populations. More precisely, we set fii = 0 and 
the elements of /x 2 are randomly drawn from the uniform distribution 
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where 


^l = (i; o ,o;_ no )s- 1 (i , no ,o;_„ o y 

is the Mahalanobis distance under the localized scenario, and (3 is a pa¬ 
rameter chosen to fulfill the requirement 

EA 2 d = -E'/xJjS 1 /i,2 = A 2 l , 

where A 2 D is the Mahalanobis distance under the delocalized scenario. Di¬ 
rect calculations lead to 

o2 = P(PP - U P + 13 ) 

12(1 ~ P + pp)(l — p) ’ 

for an equal correlation structure, = p for l ^ l' and = 1. For an 
autoregressive correlation structure, = p^ l ~ l I, we find 

2 p(24p - 13 p 2 - 13) - 24 p + 26p 2 

P ~ 12(P 2 - 1) ’ 

By focusing on the delocalized scenario, simulation study is conducted to dis¬ 
play the performances of proposed procedures. 

As the main contribution of this paper, we generalize the D- and T- criteria 
from normality to general populations and establish their asymptotic misclas- 
sification probabilities. As it will be proven, the misclassification probability of 
the D-criterion will depend on the Mahalanobis distance between the two pop¬ 
ulations, and the misclassification probability of the T-criterion will depend 
on the difference of two group mean vectors and the skewness and kurtosis 
coefficients of the two populations 77i and 77 2 . 

The rest of the paper is organized as follows. In Section [2j the asymptotic 
misclassification probability of the D-criterion under general populations is de¬ 
rived and Monte Carlo experiments are conducted to compare the performance 
with that of several existing classification rules. In Section [3j the asymptotic 
misclassification probability of the T-criterion under general populations is 
derived. And a real data is used to present the competitive performance of the 
T-criterion. The conclusion is made at the end of the paper. Technical proofs 
are relegated to the appendix. 


2 The D-criterion 

2.1 Data generation model 

Unlike the normal populations assumed in Saranadasa (1993), we assume that 
the populations 77i and II 2 have the general form as introduced in Bai and 
Saranadasa (1996), i.e. 

(a) The population X ~ TTi has the form X = TX* + . where r is 

a p x p mixing or loading matrix, and X* = (x*)i <i< P has p independent 





On Two Simple and Effective Procedures for High Dimensional Classification 


5 


and identically distributed, centered and standardized components. Moreover, 
7 a, = E[\x\\ a ) < oo and we set 9 X = E( cc* 3 ). 

(b) Similarly, the population Y ~ 77 2 has the form Y = rY* + /x 2 , where 
Y * = (Ui)i<i<p has p independent and identically distributed, centered and 
standardized components. We set = E(\yl\ 4 ) < oo and 9 y = E(yl 3 ). 

In consequence, the new observation z = rz* + /x- where z* = x* in 
distribution and = fix if z £ ZZi. Throughout the paper, we set fi = 
r _1 d = (P)i<i<p, ax = nx/(nx + 1) and a 2 = n 2 /(n 2 + 1). 

Notice that the data-generation model (a)-(b) are quite general meaning 
that the population are linear combinations of some unobservable indepen¬ 
dent component. They are also adopted in overall recent studies on high¬ 
dimensional statistics, see Chen et al. (2010), Li and Chen (2012), Srivastava 
et al. (2011) and etc. 


2.2 Asymptotic misclassification probability 


The D-criterion ([I]) is easily seen equivalent to classifying z into Tlx when 

ai(z - x)'A^ 1 (z - x) < a 2 (z - y) , A~ 1 (z - y), (3) 


where 


n\ n 2 

A = _ x )( x * _ x )' + _ y)( y i - y ) 7 ’ 

*=1 3=1 


(4) 


involves correlation information between variables. This criterion has a straight¬ 
forward form and does not need a preliminarily selected subset of features or 
any thresholding parameter. 

The associated error of misclassifying z € IIx into 77 2 is 

P( 2|1) = P {ot(z - x)'A _1 (z - x) - a 2 (z - y)'A _1 (z - y) > 0|z G Pi}.(5) 

Under the data-generation models (a) and (b), since x, = fx* + /^i,y, = 
Ty* + l 1 2j we have A = TAT, or TA -1 T = A -1 , where 

n 1 n 2 

A = £(x* - x*)(x- - X*)' + £(y* - y*)(y*j - y*)'- (6) 

i=1 3=1 

The misclassification probability ([5]) is rewritten as 

P(2|l) = P{ai(z* - x*)'TA _1 T(z* - x*) 

—a 2 (z* - y* - T _1 d) , TA~ 1 T(z* - y* - T _1 d) > 0|z G Pi} 

= P{<a 1 (z* — x*)'A _1 (z* — x*) 

-a 2 (z* - y* - A)'A _1 (z* - y* - A) > 0|z € n i }■ 

Here is the first main result of this paper. 


(7) 
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Theorem 1 Under the data-generation models (a) and (b), assume that the 
following hold: 

1. p/n —► y £ (0,1) and n±/n —> A £ (0,1), where n = n\ + n 2 — 2; 

2. E\x\\ A+b < oo and P|y*| 4+fc < oo for some constant b' > 0. 

Then as p,n —► oo, the misclassification probability for the D-criterion 
satisfies 


lim{P(2|l) = 0, 


( 8 ) 


where 

#i =- , v/Hi, A 2 = m\ 2 = 5"Z-'5, 

zs 7/ze Mahalanobis distance between the two populations III and 77 2 . 

The proof of the theorem is given in Appendix 1. The significance of the 
result is as follows. The asymptotic value of P(2|l) depends on the values of 
y, A and A 2 , and is independent of other characteristics of the distributions 
77i and 772. Firstly, this asymptotic value is symmetric about A, so the value 
remains unchanged under a switch of the populations ZZi and 77 2 . Secondly, 
if n\ and n 2 do not have large difference, i.e. A —> 0 or A —> 1, the asymptotic 
value of P(2|l) mainly depends on A when y is fixed. In other words, the 
classification task becomes easier for the D-criterion when the Mahalanobis 
distance between two populations increases as expected. However, when y —> 1, 
the number of features is very close to the sample size, the classification task 
becomes harder for the D-criterion due to the instability of the inverse A -1 , 
a phenomenon well-noticed in high-dimensional statistical literature. 

Under normal assumption, Saranadasa (1993) derived another asymptotic 
value for P(2|l) 

lim{P(2|l) - 0(z? 2 )) = 0, = -^Ayjl - y. 

Notice that = r ■ i? 2 , with 

1 

r = 7^^TT' 

Y A(l —A)Zi 2 ^ ± 

Let us comment on the difference between ^(z?i) and <7 (i 7 2 ). The value of 
A does not influence on the difference significantly. Without loss of generality, 
let A = 1/2. The factor r is 1/2 when y and A 2 satisfy y/A 2 = 3/4. Under this 
setting, Figure [I] shows the asymptotic values ^($ 2 ) and compares them 

to empirical values from simulations, as y ranges from 0.1 to 0.9 with step 0.1. 
Obviously, the difference between the two values are non-negligible ranging 
from 3.5% to 5.5%. Moreover, d>{ r di) is much closer to the empirical values 
than <7(z7 2 )- So our asymptotic result is more accurate. Other experiments 
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Fig. 1 Comparison between ^($ 1 ) (solid), ^($ 2 ) (dashes) and empirical values (dots) with 
10,000 replications under normal samples, m = n 2 = 500 and p ranges from 50 to 450 with 
step 50. 




(a) (b) 


Fig. 2 ^($ 1 ) (solid) comparison of with empirical values (dashes) under different scenarios 
and with 10,000 replications for normal samples. Sample size n = 500, and n\ = 712 = n/2 
for the left, and ni = n/ 4, 722 = 3n/4 for the right. 


have shown that only when the ratio of y and A 2 reaches some small values 
as of order 10~ 2 , the difference between them can be negligible. 

Additional experiments are conducted to check the accuracy of the asymp¬ 
totic value ^($ 1 ). Figure [ 2 ] compares the values of to empirical values 

from simulations for normal samples. The empirical misclassification proba¬ 
bilities are very close to the theoretical values of ^('di). It’s the same for both 
n\ = ri 2 and ni/ri 2 = 1/4 situations. 
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Table 1 Comparison of the D-criterion with existing classifiers under the equal correlation 
setting for normal samples: median of test classification error (with their standard errors in 
parentheses) 


p 

D-Criterion 

ROAD 

S-ROAD1 

S-ROAD2 

NB 

Oracle 

T-Criterion 

0 

9.6(1.55) 

9.4(2.91) 

11.4(3.54) 

9.6(3.24) 

6.6(1.23) 

5.6(1.13) 

6.2(1.18) 

0.1 

9.2(1.52) 

8.4(2.50) 

8.6(2.58) 

8.4(2.50) 

12.4(1.57) 

5.4(1.12) 

12.4(1.57) 

0.2 

8.0(1.49) 

7.2(2.39) 

7.4(2.42) 

7.2(2.39) 

16.8(1.77) 

4.4(1.06) 

16.8(1.76) 

0.3 

6.4(1.37) 

6.0(1.87) 

6 .0(1.86) 

6.0(1.87) 

20 .2(1.88) 

3.4(0.96) 

20.2(1.87) 

0.4 

5.0(1.24) 

4.6(1.55) 

4.6(1.55) 

4.6(1.55) 

22.6(1.94) 

2.4(0.82) 

22.6(1.94) 

0.5 

3.4(1.04) 

3.2(1.02) 

3.2(1.02) 

3.2(1.02) 

24.6(2.00) 

1.6(0.65) 

24.6(1.99) 

0.6 

2.0(0.79) 

1.8(0.73) 

1.8(0.74) 

1.8(0.73) 

26.2(2.04) 

0.8(0.46) 

26.2(2.03) 

0.7 

0.8(0.51) 

0.8(0.47) 

0.8(0.47) 

0.8(0.47) 

27.4(2.06) 

0.2(0.26) 

27.4(2.05) 

0.8 

0 .2(0.22) 

0 .2(0.20) 

0 .2(0.20) 

0 .2(0.20) 

28.6(2.08) 

0.0(0.09) 

28.6(2.07) 

0.9 

0 .0(0.02) 

0 .0(0.02) 

0 .0(0.02) 

0 .0(0.02) 

29.6(2.10) 

0 .0(0.00) 

29.6(2.10) 


2.3 Monte Carlo experiments 

We conduct extensive tests to compare the D-criterion with several existing 
classification methods for high-dimensional data, the ROAD method and its 
variants S-ROAD1 andS-ROAD2, SCRDA, and the NB method, as well as the 
oracle. The oracle is defined following Fan et al. (2012) as the Fisher’s LDA 
with true mean and true covariance matrix. 

In all simulation studies, the number of variables is p = 125, and the sample 
sizes of the training and testing data in two groups are n\ = ri 2 = 250. The 
sparsity size is set to be no = 10. A similar setting is used in Fan et al. (2012). 
Delocalized scenario is considered. 


2.3.1 Equal correlation setting 

In this part, the covariance £ is set to be an equal correlation matrix and 
correlation coefficient p ranges from 0 to 0.9 with step 0.1. 

Simulation results for normal samples are shown in Table [l] and a graphical 
summary is given in Figure [3] including the median classification errors and 
standard errors. The D-criterion performs similarly to the ROAD in terms of 
classification errors and is more robust than ROAD when p is smaller than 
0.5. The NB and the T-criterion lose efficiency when correlation exists in this 
setting. Notice that the results of SCRDA calculated using the R package 
provided by Guo et al. (2005) are not included. The package turns out to fail 
in some of our settings and report “NA” value. The percentage of failures in 
the simulations can reach 58%. Therefore, it is unreliable to include SCRDA 
for comparison. 

Simulation results for Student’s t (degree of freedom is set to be 7) sam¬ 
ples are shown in Table [2] All classifiers have slightly higher misclassification 
rates for Student’s t samples. S-ROAD1 and S-ROAD2 have larger standard 
errors. And S-ROAD1, NB and T-criterion lose efficiency when correlation is 
significant. The D-criterion outperforms the others except ROAD in term of 
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(a) Median classification errors (b) Standard errors 

Fig. 3 The median classification errors and standard errors for various methods under equal 
correlation structure and delocalization: D-criterion (solid); ROAD (dash); S-ROAD2 (dot); 
Oracle (cross). 


Table 2 Comparison of the D-criterion with existing classifiers under the equal correlation 
setting for Student’s t samples: median of test classification error (with their standard errors 
in parentheses) 


p 

D-Criterion 

ROAD 

S-ROAD1 

S-ROAD2 

NB 

Oracle 

T-Criterion 

0 

12.0(1.55) 

9.0(2.76) 

9.0(2.80) 

9.0(3.24) 

9.1(1.29) 

7.8(1.29) 

8.6(1.24) 

0.1 

11.6(1.56) 

9.8(3.11) 

15.2(6.32) 

11.6(3.61) 

15.2(4.17) 

7.6(1.27) 

14.8(3.40) 

0.2 

10.4(1.48) 

8.6(2.81) 

19.6(6.76) 

11.4(3.44) 

19.2(7.00) 

6.6(1.23) 

19.0(5.79) 

0.3 

9.0(1.38) 

7.4(2.36) 

24.0(7.26) 

10.6(3.00) 

22.4(8.83) 

5.6(1.16) 

22.0(7.58) 

0.4 

7.6(1.27) 

6.0(1.50) 

27.6(8.06) 

9.2(2.73) 

24.8(10.15) 

4.6(1.06) 

24.2(8.99) 

0.5 

6.0(1.13) 

4.8(1.00) 

28.9(9.35) 

7.8(2.26) 

27.0(11.11) 

3.4(0.91) 

26.2(10.11) 

0.6 

4.4(0.97) 

3.4(0.84) 

29.2(10.83) 

6.0(1-73) 

29.0(11.90) 

2.4(0.75) 

27.6(11.02) 

0.7 

2.8(0.78) 

2.0(0.65) 

29.2(12.32) 

4.0(1.26) 

30.6(12.51) 

1.4(0.57) 

29.0(11.79) 

0.8 

1.2(0.53) 

0.8(0.43) 

28.8(13.74) 

2.0(0.90) 

32.0(13.01) 

0.6(0.36) 

30.2(12.44) 

0.9 

0.2(0.23) 

0 .2(0.20) 

28.6(15.06) 

0.4(0.39) 

33.4(13.35) 

0.0(0.14) 

31.2(12.96) 


classification error. But the D-criterion has the smallest standard error which 
is close to that of Oracle. 


2.3.2 Autoregressive correlation setting 

In this part, the covariance S is set to be an autoregressive correlation matrix 
and p ranges from 0 to 0.9 with step 0.1. Previous the results have shown that 
NB is not a good rule when significant correlation exists. Therefore, NB is 
no more included in comparison. Since the comparison results are similar in 
normal samples and Student’t t samples, we only use normal samples in this 
part. 

Simulation results are shown in Table [3] and a graphical summary is given 
in Figure [4] The T-criterion is only suitable for independent case p = 0, and 
loses efficiency when p > 0. The D-criterion has the same performance with 
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Table 3 Comparison of the D-criterion with existing classifiers under the autoregressive 
correlation setting: median of test classification error (with their standard errors in paren¬ 
theses) 


p 

D-Criterion 

ROAD 

S-ROAD1 

S-ROAD2 

Oracle 

T-Criterion 

0 

9.6 (1.55) 

9.4 (2.91) 

11.6 (3.54) 

9.6 (3.24) 

5.6(1.13) 

6.2(1.18) 

0.1 

11 .8(1.68) 

11.4(3.42) 

12.8(3.67) 

11.6(3.61) 

0.0(0.09) 

8.0(1.31) 

0.2 

14.2(1.80) 

13.4(4.27) 

14.4(4.02) 

13.6(4.39) 

0.0(0.15) 

10.0(1.44) 

0.3 

16.4(1.89) 

15.4(5.48) 

16.0(4.61) 

15.6(5.55) 

0.4(0.33) 

12.2(1.57) 

0.4 

18.6(1.99) 

17.4(6.78) 

17.8(5.95) 

17.6(6.73) 

1.8(0.64) 

14.8(1.70) 

0.5 

20.8(2.07) 

19.6(7.54) 

20.0(7.29) 

19.8(7.52) 

4.6(1.02) 

17.8(1.81) 

0.6 

22.6(2.16) 

22.0(7.53) 

22.6(7.34) 

22.2(7.46) 

8.6(1.38) 

21.4(1.92) 

0.7 

23.6(2.26) 

23.8(7.71) 

26.0(7.54) 

24.0(7.64) 

12.6(1.71) 

25.0(2.03) 

0.8 

22.8(2.38) 

23.2(8.14) 

30.6(7.67) 

23.8(8.19) 

14.6(1.94) 

31.0(2.12) 

0.9 

17.0(2.39) 

17.0(7.31) 

33.4(9.13) 

18.0 (8.26) 

11.4(1.93) 

37.0(2.19) 




(a) Median classification errors (b) Standard errors 

Fig. 4 The median classification errors and standard errors for various methods under au¬ 
toregressive correlation structure: D-criterion (solid); ROAD (dash); S-ROAD2 (dot); Oracle 
(cross). 


ROAD and S-ROAD2 in terms of classification error. Moreover, the D-criterion 
is much more robust and has a standard error close to that of the oracle. 

In conclusion, compared to these existing methods, the D-criterion is com¬ 
petitive for “large p-large n” situation specifically under delocalized scenario 
and autoregressive correlation structure. In such a scenario, the D-criterion has 
a classification error comparable to that of the Road-family classifiers while 
being the most robust with a much smaller standard error close to that of the 
oracle. 


3 The T-criterion 


Notice that one limitation of the D-criterion is that the dimension p must be 
smaller than the sample size n. In addition, when the ratio p/n is close to 1, 
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the performance of this criterion becomes bad due to the matrix A is close to 
singular. The T-criterion in contrast does not have such a limitation. 


3.1 Asymptotic misclassification probability 


The T-criterion <[2j) is easily seen equivalent to 

ai(z-x)'(z-x) < a 2 (z-y) , (z-y). (9) 

Obviously, the T-criterion has a very simple form only involving the group 
mean vectors. In particular, it does not require to select a subset of features 
or to choose a threshold parameter. 

When z £ TTi, the error of misclassifying z into II 2 is 

P(2|l) = Pja i(z - x)'(z - x) - a 2 (z - y)'(z - y) > 0|z £ TTij. (10) 

Here is the second main result of this paper. Throughout the paper, 1^ is a 
length d vector with all entries 1, 0^ is a length d vector with all entries 0. 


Theorem 2 Under the data-generation models (a) and (b), assume that the 
following hold: 

1. "lA+b',x = E\xl\ 4:+b < 00 and = E\y \| 4+6 < 00 for some constant 

b' > 6; 

2. the covariance matrix X is diagonal, i.e. X = diag(au)i<i< p ; 

3. supp^! {\5i\,auSf,l = 1,... ,p} < 00 ; and 

2+b 4- A 4 + 2h 

4. ^ 1=1 11 - 2^i=i 1 -^ q as p _» 00, where b = U/ 2 . 

(ELr cruS?) 1+ * 

Then we have as p —> 00 and n* = min(ni,n 2 ) -4 00 , 


where 


lim|P(2|l) ( - 


a.2 




= 0, 



P(X 2 ) +4 e x ( — 
\n 2 

ra+ °(5)' 



i;p 3 <5 


(ii) 


The proof of the theorem is given in Appendix 2. Assumption 1 is needed for 
dealing with non-normal populations. Assumption 3 is a weak and technical 
condition without any practical limitation. Assumption 4 is satisfied for most 
applications where typically E?=i °n +6 ’Ef=i a u^i and Ef=i $i +2b are all of 
order p. The main term of B 2 is, 

B 2 p « 4<5'X<5, 
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since it has the order 0(p) and other terms are 0(p/ro*). In order to get more 
accurate result in finite sample case, these 0(p/n*) terms are kept in the 
Theorem. 

Notice that the main term of the approximation of P(2|l) depends on the 
ratio (d'd)/(2v / d 7 Sd). If the components Si of d satisfy |d/| > c > 0, and 
0 < d\ < A m j„(E) < A ma ,(X) < d 2 for positive constants c,di,d 2 , then when 
p —► oo, 


d'Sd > pd\C 2 —>• oo, 


and 

MbM p(2|i) —> o. 

In other words, the classification task becomes easier when the dimension 
grows. In other scenarios, this misclassification probability is not guaranteed 
to vanish. For example, under a localized scenario, di = • • • = 6 no = c ^ 0, 
Si = 0 for l > no and no is fixed and independent of p, then 


c fno < S'S ^ c jno 
2 V d 2 ~ 2\/d 7 Sd ~ 2 V di ’ 


i.e. liminf P(2|l) > d> 



to 0. 


Next, we provide below some simulation results to demonstrate the impor¬ 
tance of keeping the 0(p/n*) terms in B ’ 2 . The experiments use p = 500 and 
various combinations of sample sizes (n i, n 2 ) with normal samples and gamma 
samples, respectively. Empirical classification errors are compared in Figure [5] 
to the following three approximations of the variance F> 2 : 

W = 4 (ft + £) MS 2 ) + 40, (£ - i) 1 ' p i*6 + 4 (l - ^) d'Sd; 

- Bp( 2) = 4 (i + P) tr(S 2 ) + 4 (l - i) d'Sd; 

- P 2 (3) =4d'Sd. 

Among the three, the proposed approximation P 2 ( 1) matches very well the 
empirical values, while P 2 (3) is by far the worst in all tested cases. As for 
Bp( 1) and P 2 (2), they are by definition the same for normal samples (since 
0, =0)- For gamma samples, they remain close each other particularly when 
the relative difference of sample sizes (l/n 2 — 1/ni) become small, and U 2 ( 1) 
has an overall slightly better performance than U 2 (2) (in these tested cases). 
Notice that the gamma standardized variables are x* = (u, — 1) where U; is 
gamma distributed with unit shape and scale parameters so that 0, = —2. 

Under normal assumption, the expectation of X)f=i k (defined in Ap¬ 
pendix) is the same with (|17|, and the variance simplifies to 


B l 



— ) tr(E 2 ) + 4 ( 1 - — ) S'VS + o( 4 ), 
ni) V n 2 ) \n% J 


which coincides with the result established in Saranadasa (1993). 
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n 1 (n 2 -n,+100) 

(a) Normal samples 



Fig. 5 The empirical values (solid) are compared to asymptotic values: dots (-Bp(l)), dashes 
(f?p(2)) and dash-riot,s( B p (3)), with 10,000 replications for normal samples and gamma 
samples, p = 500, m range from 50 to 500 with step 50 and 712 = n\ + 100. 


Table 4 The T-criterion under delocalization setting: median of test classification errors 
(with standard errors in parentheses) 




P > 

n 




p < n 



ni = n 2 

100 

150 

200 

250 

300 

350 

400 

450 

500 

median 

13.00 

11.00 

9.75 

9.00 

8.50 

8.14 

7.88 

7.56 

7.40 

s.e. 

(2.52) 

(1.90) 

(1.57) 

(1.35) 

(1.20) 

(1.11) 

(1.01) 

(0.95) 

(0.89) 


3.2 Monte Carlo experiments 


We conduct simulations to show the performances of the T-criterion for normal 
distributions under delocalized scenario. In the simulation studies, the number 
of variables is p = 500. Without loss of generality, the sample sizes of the 
training and testing data in two groups are equal and range from 100 to 500 
with step 50. The covariance £ is set to be an identity matrix l p and the 
sparsity size is n o = 10. 

Simulation results are shown in Table |U The classification error decreases 
as sample size increases. Meanwhile, small standard errors indicate that the T- 
criterion is robust with respect to the delocalization nature of mean differences. 
Notice that the T-criterion is an independence rule. It’s suitable for case where 
variables are independent or the correlations between variables are weak. As 
shown in Tables 1-3, the T-criterion has very high misclassification rate when 
variables have significant correlations. 
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Table 5 Classification error and number of used genes for the leukemia data 


Method 

Training error 

Testing error 

Number of genes used 

T-criterion 

0 

2 

7129 

ROAD 

0 

1 

40 

SCRDA 

1 

2 

264 

FAIR 

1 

1 

11 

NSC 

1 

3 

24 

NB 

0 

5 

7129 


3.3 A real data analysis 

In this part, we analyze a popular gene expression data: ‘leukemia’ (Golub et 
al. 1999). The leukemia data set contains p = 7129 genes for m = 27 acute 
lymphoblastic leukemia and 712 = 11 acute myeloid leukemia vectors in the 
training set. The testing set includes 20 acute lymphoblastic leukaemia and 
14 acute myeloid leukemia vectors. Obviously, this data set is a “large p-small 
n” case. The classification results for the T-criterion, ROAD, SCRDA, FALR, 
NSC and NB methods are shown in Table[5] (The results for ROAD, SCRDA, 
FAIR, NSC and NB are found in Fan et al. (2012).) The T-criterion is as 
good as ROAD and NB in terms of training error. ROAD and FAIR perform 
better than T-criterion in terms of testing error. Both of NB and T-criterion 
make use of all genes, but T-criterion outperforms NB. Meanwhile, T-criterion 
performs better than NSC. Overall, on this data set, T-criterion outperforms 
SCRDA, NSC and NB, equally well as FIRE, and is beaten only by ROAD (2 
v.s. 1 errors). It’s quite surprising that a “simple-minded” rule like T-criterion 
has a performance comparable to a sophisticated rule like ROAD. 


4 Conclusion 

We have proposed two new classification rules for high-dimensional data, 
namely the D-criterion and the T-criterion. Both methods consider the overall 
within group sum of squares and cross products matrices. The D-criterion com¬ 
pares the determinants of these matrices and integrates correlation information 
between variables. The D-criterion performs well when correlations between 
variables become significant. When the correlation coefficient increases, the 
classification error of the D-criterion drops. The incorporation of covariance 
structure therefore strengthens the effectiveness in high dimensional classifica¬ 
tion. The T-criterion, on the other hand, compares the traces of these matrices 
and involves only group mean vectors. The implementation of these two cri¬ 
teria is straightforward and it does not suffer from challenging issues such as 
variable selection, thresholding or control of the sparsity size that are required 
in the existing methods. We found D-criterion is particularly competitive in 
delocalized scenario. When p > n, the T-criterion is quite effective as proven 
by the real data analysis. 
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Moreover, using the explicit forms of the criteria and recent results from 
random matrix theory, we are able to derive asymptotic approximations for the 
misclassification probability of both criteria. Notice that such asymptotic ap¬ 
proximations are unknown for most of the existing high-dimensional classifiers 
in the literature. Simulation results have shown that the proposed approxima¬ 
tions are quite accurate for both normal and non-normal populations. 


A Appendix Technical proofs 

A.l Proof of Theorem 1 


We first recall two known results on the Marcenko-Pastur distribution, which can be found 
in Theorem 3.10 in Bai and Silverstein (2010) and Lemma 3.1 in Bai et al. (2009). 

Lemma 1 Assume p/n —¥ y E (0,1) as n —¥ oo, for the sample covariance matrix S = 
A /n, we have the following results 

(i) 


— tr (S 1 ) a\, —tr (S 2 ) • 

V V 


CL2, 


where a\ = -r^— and a 2 = ,, 1 ; 

i -y (i -vr ’ 


(2) Moreover, 


%-* u-.ty. — */q — 1 -* 

x o x r a i , y o y 


ai,i = 1 , 2 . 


Under the data-generation models (a) and (b), let 12 = (A,x*,y*). Conditioned on 12, 
the misclassification probability 0 can be rewritten as 

Pn(2|l) = P(K>o\n) =P n (K> 0), 


where 


JfaQl(z*-x*)'A 1 (z* - x*) - a 2 (z* - y* - ft)'A 1 (z* - y* - ft). 

Therefore, Pq(2\1) = Pq (K > 0) where z E 17i is assumed implicitly. 

We evaluate the first two conditional moments of K. 

Lemma 2 Let A -1 = (fy/Oz V=i p • We have 

( 1 ) 


( 2 ) 


M p = E(K\f2) 


= (ai — ot. 2 )tr(A. 1 )+aix* / A x x* 

-« 2 (y* +A); ( 12 ) 

Bp = Var(K\{2) 

= («i - Ct 2 ) 2 ("/x - 3) bf t + 2(ai - a 2 ) 2 tr(A~ 2 ) + 4afx*'A~ 2 x* 
l 

+4«|(y* + ft)' A~ 2 (y* + p.) + (4aia 2 - 4 a%)0 x A _1 (y* + p.))i 

l 

- 8 oi «2 ^^x*b w (A~ 2 (y* + jj ,)); + (4aia 2 - 4 a\)0 x ^ b u ( A _ 1 x*)i. 

IV l 


(13) 
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Proof of Lemma\^ It is easy to obtain the conditional expectation ( |l2| >. For the condi¬ 
tional variance of K , we first calculate the conditional second moment 

E(K 2 \Q) = Enfxl [z‘'A-V - 2x* , A~ 1 z* + x-'A- 1 **] 2 

+a|[z* , A~ 1 z* - 2(y* + /z)'A _1 z* + (y* + A)'A _1 (y* + A )] 2 
— 2aiQ!2[z* , A _1 z* — 2x*'A _1 z* + x*'A _1 x*] 

x [z*'A _1 z* — 2(y* + A)'A _ 1 z* + (y* + A)'A _1 (y* + A)]j- 

Since 

E n [z* , A- 1 z*] 2 = ( 7x - 3)^] ft 2 , + (trA -1 ) 2 + 2tr(A -2 ); 

l 

E n [z* , A _1 z* • x* , A _1 z*] ^0 x ^bu(A- 1 x*) i ; 

l 

E n [z*'A _1 z* • (y* + A)'A _1 z*] = 9 X ^ ft,, (A _1 (y* + A)),! 

l 

En [x^A-iz* • z^'A- 1 ^] = x*'A- 2 x*; 

En [(y* + A)'A _1 z* ■ z* , A _1 (y* + A)] = (y* + A)'A _2 (y* + A), 

we obtain 

E{K 2 \Q) = (ai - a 2 ) 2 (7* - 3) ^ ft?, + (a, - a 2 ) 2 (trCA” 1 )) 2 + 2( ai - a 2 ) 2 tr(A~ 2 ) 

l 

+4a?x*'A _2 x* + 4a|(y* + A)'A~ 2 (y* + A) - 8aia 2 x*'A _2 (y* + A) 

+2ai(oi - a 2 )tr-(A -1 )(x*'A~ 1 x*) + 2a 2 (a 2 - ai)tr(A _1 )(y* + A)'A _1 (y* + A) 
+4ai(a 2 - ai)9 x ’Y b u (a _ 1 x*) ; + 4a 2 (ai - a 2 )6 x ^ ft,, (A _1 (y* + A)) ; 

i i 

+ (aix*'A _1 x* - a 2 (y* + A)'A _1 (y* + A)) 2 - 

Finally, by 

Var(K\n) = E(K 2 \Q) - E 2 (K\Q), 

equation ( |13|) follows. The Lemma [ 2 ] is proved. 

The first step of the proof of Theorem^ is similar to the one of the proof of Theorem [2] 
where we ensure that K — E(K) satisfies the Lyapounov condition. The details are referred 
to \13\ . Therefore, conditioned on 12, as n —> 00, the misclassification probability for the 
D-criterion satisfies 

Next, we look for main terms in M p and Bp , respectively, using Lemma |5J For M p , we find 
the following equivalents for the three terms 
1. 

(aq — a 2 )tr(A _1 ) = -(aq — a 2 ) ■ -tr( S _1 ) 

n p 

n l \ n 2 + 1 ni + l/J n 
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2 . 


3. 


Finally, 



« 2 (y*+A)'A l (y*+ji) 


“1 _* - 2 A N 

— • « 2 y + fj. + o(-) 
n n 


1. 


M P = — ■ {p ( — E - E ) +ai| |x*| +a 2 ||y* + A 

n l V n 2 + 1 n i + 1 / 11 11 11 

As for B 2 , we find the following equivalents for the seven terms 



(14) 


2 . 


3. 


4. 


5. 


(ai - a 2 ) 2 (7* - 3) y b 2 n 
l 


— ( — 1 - 1 — 

n 2 \n 2 + 1 m + 1 



2 

17* ~ 3 | • tr (S~ 2 ) 



2(ai — a 2 ) 2 tr(A -2 ) 
2/1 1 
n 2 \n 2 + 1 m + 1 


2 




tr(S“ 2 ) 


4afx* / A 2 x* = 4a 


2 a 2|[x’* 

1 


IP 1 

— + 0 ( —); 


4 al(y*+A)'A 2 (y*+M)= 4 a\ 


a- 21 |y* + A| 

n 2 


■ + o(—); 

n z 


4 a 2 |o;i — |^cc E MA 1 (y*+/i))i 

l 

—-7 ~ -^rrl V^fS-^y* + £)), 
n 2 + 1 ni + 1 | ^' 


4<*2 

n 2 


< 4a 2 
— n 2 


1 


1 


n 2 + 1 ni + 1 


E 1 



E( § 1(r+ E 


—§-Vp- y* + A v^2 + o( ^ )s 

n° n z y/n 


1 

2 
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8 aia 2 Y x*b w ( A 2 (y* + fi))i < 8 Ql ° 2 % /p- I |y* + A| |\/°2 + o( 1 ); 

' n 05 11 11 n z y/n 


(4aia 2 -a?)^Y MA" 1 **); < ^■ l|x*|l^ + <>(-2^=). 

' n° n z y/n 


l 

It can be proved that almost surely, 

11^*112 P 


x-ir - — ->o, 

ni 

, -|| 2 

y + H - 

ri2 


(i^) 


0 , 


y* + A - J— + 4\ 2 -*■ 0. 

11 11 V ri2 

Then the terms 2 and 3 are of order O(-tj-) and 5-7 are of order o(^-). Finally, 


2 _ ,„2«2||x *|| 2 


B P = 4 «1- 2 


■4aS 


,“2 y* +A 


+ o(-j). 

n z 


Since n\/n —> A, we have 


Finally, it holds almost surely, 


n\ —¥ nX , ri2 —)► n(l — A). 


lim < 


-- 


vW 21 




This ends the proof of Theorem [T] 


0. 


(15) 


A .2 Proof of Theorem | 2 ] 

By the assumption 2 in Theorem|2| the covariance matrix is X = diag(cr^)i</< p . Under the 
data-generation models (a) and (d), the misclassification probability GD can be rewritten 
as 

P(2|l) = P{ai(z* - x*)'£(z* - x*) - « 2 (z* - y* - A)'S(z* - y* - A) > o|z e Hi} 

Z S 77i , (16) 


- p h > o 

\l=l 


where 


ki = ai (z* - x*) 2 aii - a 2 {z* - y* - m) 2 rr n . 
We firstly evaluate the first two moments of 
Lemma 3 Under the data-generation models (a) and (b), we have 
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(i) 


and 


( 2 ) 


E(h) = -OL2 Olljtf, 


p 

M P = ^2 E (h) = -a 2 ||<5|| 2 ; 
1 = 1 


(17) 


Var{ki) = of; {/? 0 + A(t) + P 2 (6)p,i +4 a 2 fJ, 2 } , 


and 


v 

B 2 p = Var(fci) = [/3o + /3 i( 7 )] tr(S 2 ) + p 2 (8)lT 3 8 + 4a 2 <5'£<5, (18) 

1 = 1 


where 


9 6n? + 3ni — 3 

/5 0 = —i- 3 - 


2 6n| + 3ri2 — 3 


ft (7) = 7* 4 + (ai - a 2 ) 2 + -§ 7 y 


+ 2(aia2 — 1) 5 


(32(0) = 4a2(c*i — ot-2)0 x H- o^y- 


If removing the small terms with order 0(p/n 2 ), then the formula of Bp in Theorem j^j 
zs obtained. 


Proof of Lemma pi Since z*,(x*) and (y*) are independent, the variables (ki)i = i f .. mfP 
are also independent. 4or the expectation of ki , we have 

S(fci) = oio-n ■ E(z * - S*) 2 - o 2 ou ■ E(z* - y* - p,i) 2 
= 01 an ■ a" 1 - 020-11 ■ (a” 1 + A?) 

= -OL 2 Oufl 2 . 


Equation © follows. 

For the variance, we have 

Varikt) = E[k, - E(k t )] 2 

= o 2 - E {ai(z;* - tf ) 2 - a 2 (z* - yf - P-i ) 2 + a 2 tf } 2 
= ■ {<*?£(*? - x H 4 + a 2 E(z* - y D 4 + 4o 2 y 2 E{z* - tf ) 2 

—2aia 2 E [(zf - xf) 2 (z* - y*) 2 ] - 4 a%p,iE(zf - y*) 3 
+4ai a 2 y,iE[(z? -S,*) 2 (z;* - yf)]}. 


s [zf “ = 7* 



6n 2 + 3ni — 3 


n 


3 

l 


Moreover, 
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cr„* -*i4 _ , 7v , 6n2+3n 2 -3 

E [ z l - Vi \ — lx ~\ 3 H- 3 -, 

n 2 n 2 


r~i r * -*1Z _ —_L 

— Vl\ — a 2 > 


n 2 

e {h* - *nv - sn 2 } = + —— i, 

k J Q;iQ;2 


and 


M(*?-*n a (*r-#)}=*»■ 


Finally, we obtain 
Var(ki) = <rfJ a. 


, 1 \ 6n? + 3m — 3 

lx |1 H- 3 ' ' 


6n| + 3n 2 — 3 


+4a|/ifQ! 2 1 - 2aqa 2 


7a; ’ 


- 1 


0!lO!2 


H 3 
n 2 


+ 4aia 2 p,iO x - 4 a^jii 


n 9y 

Vx 2 


7*(a? + ^+a!-2aia 2 )+ a t^+a? 


2 6n 3 + 3ni — 3 


+a : 


2 6n| + 3n 2 — 3 


- 2 + 4a 2 /r 2 + 2aia 2 + 4a 2 (ai - a 2 )fi t 9 x + ^9 y 


= CT; 2 ; j/3 0 +/3 i(7) + /32 (9) fit +4a 2 /if|. 

Equation ( | 18[ l follows. Then L 2 can be rewritten as 
6ni + 3 6 n 2 + 3 2 2 


Bi = 


.(ni + 1) 2 (n 2 + l) 2 ni + 1 n 2 + 1 (m + l)(n 2 + 1) ni(«i + l) 2 n 2 (n 2 + 


lx 


+ 


7!/ 


27s: 


lx 


lx 


(ni + l) 2 (n 2 + l) 2 (ni + l)(n 2 + l) ni(n.i + l) 2 n 2 (n 2 + l) 2 . 

n 2 ( 1 1 \ 4 

4 —F -7-r 0* + 0. 

n 2 + 1 \n 2 + l ni + 1 / n. 


tr(S 2 ) 


i;r 3 5 


+4 na <5'S<5 
ro 2 + 1 


7x . 7a 2q x 


lx ly 


m n 2 n 2 n| nin 2 nf n| n 2 n 2 rtj)r 2 rij n|j 


i;r 3 5 


tr(S 2 ) 


/ 1 

1 '' 

4 „ 

4 — - 


9x-\ -2 9y 

\n 2 

ni ; 

n 2 \ 


+4 f 1 — — ) <5'S<5. 

V ^2 / 

Only keep the terms with order O(p) and 0(p/n *) we can get the formula of B p in Theo¬ 
rem [2] The Lemma [3] is proved. 

We know that [ki — E(ki)] 1<l<p are independent variables with zero mean. We use the 
Lyapounov criterion to establish a CLT for [ki — E(ki)], that is, there is a constant b > 0 
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such that 


lim B p (2+6) Y E 
0—^00 • ^ 


Iki-Eiki) 


I 2+6 


0. 


Since 


\ki - E(ki)\ = crJeti( 2 * - X ,*) 2 - a 2 (z* - y*) 2 + 2a 2 A+f - yf)\ 


{ 


<J|| K'-ii +\z*-y*\ + 2 A; 


{ 


<(Jli\\zi-Xi\ +2\z l -y l \ + Ai 


h*-s?|} 

■} 


S a ii 


( 2 (i 2 *r+i^r)+ 4 (i*?r+ \y*\ 2 ) +i«r} 


< <?u 16 + \ x *\ 2 + |^*| 2 ) + |w| 2 } • 

the (2 +- 6)—norm of [ki — E(ki)] is 

\\ki - E(ki) Ha+b < an i 6 


i+ 2 

+ 

2+6 

l*?l a 

+ 

2+6 

isri 2 

2+6 

+i«f] 


= on 1 6 [(*i<r) ^ + (*i*?r) ^ +{ e \ 

<°ll{ 6 [ 2 7 ] + l\+^y ] + | Ai f } ■ 


Vl 


4+6 \ 4 +b' 


1 


Then 


E [k t - i?(fci)] 2+6 < c b cr?i +b ■ |l + |Ai | 4+6 } , 
where c (! is some constant depending on b. Therefore, as B 2 R3 4<5'S<5 = 4 A 2 +, 


B, 


P _2+6 , _2+6i - 14+26 

- (2+b) y - E( kl) } 2 + b < Cb . ^ y 2 — 

t? (E^+ 2 ) +/ 


= C b ■ 


EA +b + E t y 2 


(Ei °ii s ?) 1+b/2 


o, 


by the assumption 4 in Theorem p] Finally, we have 




^ ^ [fy — E(ki)\ => iV(0,1), as p —» oo, n* —>• oo. 


This ends of the proof of Theorem [2] 


References 

1. Bai Z, Liu H, Wong WK (2009) Enhancement of the applicability of Markowitz’s portfolio 
optimization by utilizing random matrix theory. Math Financ 19:639-667 

2. Bai Z, Saranadasa H (1996) Effect of high dimension: by an example of a two sample 
problem. Stat Sin 6:311-329 























22 


Zhaoyuan Li, Jianfeng Yao 


3. Bai Z, Silverstein W (2010) Spectral analysis of large dimensional random matrices. 
Science Press, Beijing 

4. Bickel P, Levina E (2004) Some theory for Fisher’s linear discriminant function ‘naive 
Bayes’, and some alternatives when there are many more variables than observations. 
Bernoulli 10:989-1010 

5. Chen SX, Zhang LX, Zhong PS (2010) Tests for high dimensional covariance matrices. 
J Am Stat Assoc 105:810-819 

6. Cheng Y (2004) Asymptotic probabilities of misclassification of two discriminant func¬ 
tions in cases of high dimensional data. Stat Probab Lett 67:9-17 

7. Fan J, Fan Y (2008) High dimensional classification using features annealed independence 
rules. Ann Stat 36:2605-2637 (2008). 

8. Fan J, Feng Y, Tong X (2012) A road to classification in high dimensional space: the 
regularized optimal affine discriminant. J R Stat Soc Series B 74:745-771 

9. Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 
7:179-188 

10. Guo Y, Hastie T, Tibshirani R (2005) Regularized discriminant analysis 
and its application in microarrays. Biostat 1:1-18. R. package downloadable at 
http://cran.r-project.org/web/packages/ascrda/. 

11. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh 
ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification 
of cancer: class discovery and class prediction by gene expression monitoring. Sci 286:531- 
537 

12. Lange T, Mosler K, and Mozharovskyi P (2014) Fast nonparametric classification based 
on data depth. Stat Pap 55:49-69 

13. Leung CY (2001) Error rates in classification consisting of discrete and continuous 
variables in the presence of covariates. Stat Pap 42:265-273 

14. Li J, Chen SX (2012) Two sample tests for high dimensional covariance matrices. Ann 
Stat 40:908-940 

15. Krzysko M, Skorzybut M (2009) Discriminant analysis of multivariate repeated mea¬ 
sures data with a Kronecker product structured covariance matrices. Stat Pap 50:817-835 

16. Saranadasa H (1993) Asymptotic expansion of the misclassification probabilities of D- 
and A-criteria for discrimination from two high dimensional populations using the theory 
of large dimensional random matrices. J Multivar Anal 46:154-174 

17. Shao J, Wang Y, Deng X, Wang S (2011) Sparse linear discriminant analysis by thresh¬ 
olding for high dimensional data. Ann Stat 39: 1241-1265 

18. Srivastava MS, Kollo T, Rosen D (2011) Some tests for the covariance matrix with fewer 
observations than the dimension under non-normality. J Multivar Anal 102:1090-1103 

19. Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types 
by shrunken centroids of gene expression. Proc Natl Acad Sci USA 99:6567-6572 

20. Vapnik VN (1995) The nature of statistical learning theory. New York, Springer 



